Mining Parallel Texts from Mixed-Language Web Pages

نویسندگان

Masao Utiyama

Daisuke Kawahara

Keiji Yasuda

Eiichiro Sumita

چکیده

We propose to mine parallel texts from mixedlanguage web pages. We define a mixedlanguage web page as a web page consisting of (at least) two languages. We mined Japanese-English parallel texts from mixedlanguage web pages. We presented the statistics for extracted parallel texts and conducted machine translation experiments. These statistics and experiments showed that mixedlanguage web pages are rich sources of parallel texts.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Exploiting the Web as Parallel Corpora for Cross- Language Information Retrieval

The expansion of the Web creates more requirements for Cross-Language Information Retrieval (CLIR). Query translation is the key problem. Previous studies have shown that query translation can be done by exploiting a large set of parallel texts. However, the problem arisen is the unavailability of large parallel corpora for many languages. In this paper, we describe a mining system that automat...

متن کامل

Parallel Sentences Mining From The Web

Parallel sentences can benefit many NLP applications (e.g., machine translation, cross language information retrieval.) In this paper, the candidate bilingual webs pages are returned by submit sentence pairs to search engine and then validated by surface patterns. We propose an algorithm to candidate bilingual resource extraction and filter useless bilingual web pages. The pair sentences includ...

متن کامل

Babylon Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages

This paper describes BABYLON, a system that attempts to overcome the shortage of parallel texts in low-density languages by supplementing existing parallel texts with texts gathered automatically from the Web. In addition to the identification of entire Web pages, we also propose a new feature specifically designed to find parallel text chunks within a single document. Experiments carried out o...

متن کامل

Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

A major obstacle to the construction of a probabilistic translation model is the lack of large parallel corpora. In this paper we first describe a parallel text mining system that finds parallel texts automatically on the Web. The generated Chinese-English parallel corpus is used to train a probabilistic translation model which translates queries for Chinese-English cross-language information r...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Mining Parallel Texts from Mixed-Language Web Pages

نویسندگان

چکیده

منابع مشابه

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Exploiting the Web as Parallel Corpora for Cross- Language Information Retrieval

Parallel Sentences Mining From The Web

Babylon Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages

Automatic construction of parallel English-Chinese corpus for cross-language information retrieval

عنوان ژورنال:

اشتراک گذاری